Red Wine Quality Exploration by Piyush Goyal

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity   : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity: num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid     : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar  : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides       : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ Free_SO2        : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ Total_SO2       : num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density         : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH              : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates       : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol         : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality         : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.cut     : Factor w/ 3 levels "(0,4]","(4,6]",..: 2 2 2 2 2 2 2 3 3 2 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides          Free_SO2       Total_SO2         density      
##  Min.   :0.01200   Min.   : 1.00   Min.   :  6.00   Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00   1st Qu.: 22.00   1st Qu.:0.9956  
##  Median :0.07900   Median :14.00   Median : 38.00   Median :0.9968  
##  Mean   :0.08747   Mean   :15.87   Mean   : 46.47   Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00   3rd Qu.: 62.00   3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00   Max.   :289.00   Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000  
##  quality.cut  
##  (0,4] :  63  
##  (4,6] :1319  
##  (6,10]: 217  
##               
##               
## 

Lets first plot histogram of fixed acidity

The Fixed acidity value seems to dispaly a normal distribution with major samples exhibiting values between 6.5g/dm3 to 9.2g/dm3.

The Volatile acidity value seems to dispaly a bimodal normal distribution with major samples exhibiting values between 0.25g/dm3 to 0.79g/dm3 but on taking the log distribution the plot becomes normal distributed.

From Above plots, following observations are made:

Quality is distributed from 3 - 8. Most wine exhibit medium(5 - 6) quality. Very less percentage of wine is of good quality.

Also form above plot we can see that most of the wines fall in the range of (4,6] in terms of quality.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)

The following observations are made from dataSet:

  • Sulfur.dioxide (both Free and Total) is distributed over a wide range across the samples.
  • The alcohol content varies from 8.40 to 14.90.
  • The quality of the samples range from 3 to 8 with 6 being the median.
  • The range for fixed acidity is quite high with minimum being 4.6 and maximum being 15.9,
  • pH value varies from 2.720 to 4.010 with a median being 3.210.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol, quality and quality.cut. I’d like to determine which features are best for predicting quality of wine. I’d like to find which features are best for predicting quality of wine. I think along with alcohol, quantity of SO2 (free and total) and acidity (both fixed and volatile) might be used for predictive modeling to determine quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

SO2 (free and total), Acidity (both fixed and volatile), density are likely to contribute to quality of wine.

Did you create any new variables from existing variables in the dataset?

Yes, quality.cut is the variable added to the dataset which distributes the sample into 3 quality bins (0,4], (4,6] and (6,10].

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

According to all the above plots, there are some outliers in some of the features like SO2(free and total), acidity (fixed and volatile). Also the distribution for Volatile acidity apears to be bimodal normal distribution. But when taking log distribution, the plot becomes normal distributed.

Bivariate Plots Section

Lets, run scatterplot martix and see the correlation behaviour between the features.

Scatterplot outputs shows following behaviour:

Fixed Acidity

  • It shows positive correlation with citric acid which is true since citric acid is one of the fixed acid. It also shows positive correlation with density.
  • It also shows significant negative correlation with pH and volatile acidity.

Volatile Acidity

  • It is highly negatively correlated with citric acid and quality.

Free SO_2

  • It shows significant positive correlation with total SO_2, and very less correlation with sulphates

Density

  • Significant negative correlation is observed with alcohol, acidity (fixed and citric acid) and pH.

Quality Cut

  • Most of the data seems to clustered in the range (4,6].
  • Outliers are observed which we will be discussing in further analysis.

Quality

  • Quality and alcohol is positively correlated along with negative correlation with volatile acidity.

Also from above scatterplot matrix, chlorides and sulphates doesn’t seem to have any kind of effect to quality.

Positive correlation of alcohol and quality are summarized below:

There seems to be no significant bias of the alcohol content eventhough there are samples with higer Alcohol content for wine exhibiting a higher density reading for the quality levels of 3 and 5.

Negative correlation of volatile acidity and quality are summarized below:

It seems that wine with higher volatile acidity exhibiting higher density for quality levels 5,7 and 8.

##   quality Mean_Volatile_Acidity Variance_Volatile_Acidity
## 1       3             0.8845000                0.10973028
## 2       4             0.6939623                0.04844842
## 3       5             0.5770411                0.02715943
## 4       6             0.4974843                0.02590885
## 5       7             0.4039196                0.02109011
## 6       8             0.4233333                0.02100000
##   Standard_Deviation_Volatile_Acidity
## 1                           0.3312556
## 2                           0.2201100
## 3                           0.1648012
## 4                           0.1609623
## 5                           0.1452244
## 6                           0.1449138

Even though quality levels 5,7,8 exhibits higher density for volatile acidity, their mean is less than that of quality level 3 and 4. We also observe that as quality increases, the mean, variance and standard deviation decreases.

Positive correlation of Free SO_2_ and Total SO_2_ are summarized below:

Most of the points seems to be clustered around 0-20 mg/dm3 Free SO_2_ and 0-50 mg/dm3 Total SO_2_.

Lets us find how residual sugar and quality are related.

Except for quality 3, other quality rating shows higher density of residual sugar. But no pattern is observed which can help us to predict the quality of wine from residual sugar. So this is not a good attribute used to classify quality of wine.

Lets us see the relation between fixed acidity and citric acid.

Since citric acid is one of the component of fixed acid, thus exhibiting a significant positive correlation.

The positive correlation is observed. Also from range 6 - 12.5, we observe very less deviation from mean while in other values, it shows significant deviation from mean.

A significant negative correlation is observed (since as acidity increases, pH decreases)and most of the data is clustered around range 5 - 14 mg/dm_3_.

Data is clustered in the middle and some of the data is scattered around the plot and exhibiting negative correlation.

Lets us plot some box plots with quality cut to observe the outliers.

pH

Most of the outliers seems to lie in quality range (4,6].

Density

Most of the outliers seems to lie in quality range (4,6] which is something we observed in pH too.

Alcohol

In this plot, outliers only exists in qualtiy cut (4,6].

Citric Acid

This box plot seems have minimum outliers as compared to other plots and since quality and citric is positively correlated, this feature might be used for prediction of quality.

Fixed Acidity

Free S02

Both the plots above seems to contain outliers in all the quality cut range.

From all the box plots we have seen, it seems that quality cut (4,6] vs other features exhibits most of the outliers which is not good for prediction models. This behaviour of outliers may be due to reason that most of the data lies in this region as observed from the scatterplot matrix and bar plot of quality cut.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Fixed acidity and citric acid are significantly correlated.
  • Alcohol content for quality less than 6 seems to be higher.
  • Volatile Acidity is higher for quality levels more than 4.
  • Wine samples with less density have high alcohol content.
  • Density of wine varies more for fixed acidity more than 12.5 and less than 6.
  • Residual sugar cannot be used to classify quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Chlorides and sulphates does not exhibit any significant relationships with any other features. Although volatile acidity seems to exhibit less positive correlation with pH which is weird. Also, most of the outliers are in the quality range (4,6] and this is not good for the prediction models.

What was the strongest relationship you found?

Strong relationships that I observed are:

  • Positive :
    • Fixed acidity - density
    • Free SO2 - total SO2
    • Alcohol - quality
  • Negative :
    • Fixed acidity - pH
    • Volatile acidity and Citric Acid

Multivariate Plots Section

##   quality.cut Mean_Alcohol Median_Alcohol
## 1       (0,4]     10.21587           10.0
## 2       (4,6]     10.25272           10.0
## 3      (6,10]     11.51805           11.6

Something interesting emerge from this chart which is good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, when we have certain levels of both then we have a more high quality scores.

##   quality.cut Mean_Volatile_Acidity Median_Volatile_Acidity
## 1       (0,4]             0.7242063                    0.68
## 2       (4,6]             0.5385595                    0.54
## 3      (6,10]             0.4055300                    0.37

The graph shows us that good wines constitue citric acid above 0.27 g/dm_3_ and volatile acidity below 0.5 g/dm_3_.

##   quality.cut Mean_Fixed_Acidity Median_Fixed_Acidity
## 1       (0,4]           7.871429                  7.5
## 2       (4,6]           8.254284                  7.8
## 3      (6,10]           8.847005                  8.7

Wines of quality range (4,6] lies within 0.995 g / dm3, 1.000 g / dm3 range for Citric Acid and 6.5 g / dm3, 9 g / dm3 for Volatile Acidity. Good wines seems to be distributed across the plot.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5 and also when citric acid is above 0.27 g/dm_3_ and volatile acidity is below 0.5 g/dm_3_.

Following things were also observed:

  • Mean and median of alcohol seems to be more for quality cut (6,10].
  • Mean and median of volatile acidity seems to be more for quality cut (0,4] despite the smallest quality rating in data is 3.
  • Mean and median of fixed acidity seems to be more for quality cut (6,10].

Were there any interesting or surprising interactions between features?

Something interesting interaction is observed, good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, when we have certain levels of both then we have a more high quality scores.


Final Plots and Summary

Plot One

Description One

In both figures, under 95% Confidence level, it seems in quality range (0,4] the range for prediction is high as compared to that of remaining quality range, which might be bacause there is less data to train the classifier.

Plot Two

Description Two

Since citric acid adds freshness to the wine which we can see in the plot as the median of citric acid increases as quality increases.

Plot Three

Description Three

In each step we can see the negative influence of volatile acidity in a wine’s quality score.


Reflection

There are many other factors that constitute with wines of good quality. Many of them are not related with chemical properties like what we have in our dataset but are related with smells and flavours. Although our variables are kind of explanatory of what we have, we have also seen some cases where the must be other explanations for quality levels.

However, within our limitations, to be free of defectes, alcohol and acidity are very important and also discovered the very interesting concept of taste balance between alcohol and acidity.

References